Introduction

Abstract

This project aims to investigate the relation of police incident records, time, and location in San Francisco, California during 2018-2020, through analysis using R Markdown with packages, inlcluding readr, dplyr, ggplot2, ggrepel, and leaflet. Generally speaking, an obvious decrease since early 2020 can be observed in Assault, Larceny Theft, Lost Property, Non-criminal, Other Miscellaneous, Robbery, and Warrant. However, Burglary, Motor Vehicle Theft, and Recovered vehicles have experienced some increase since early 2020. Crimes are mostly reported during afternoon hours, especially around 5-8 PM weekdays and around noon all days, and there are many crimes reported around 12 AM during weekends. Arrests are mostly during 3-4 PM on Tuesdays and 12-2 PM on Wednesdays.In conclusion, the factor that affects when the crime or arrest happens the most is the crime category, while other factors such as year, month, and police precinct do not have much impact on when the crimes happens.

Overview and Motivation

This project aims to investigate the relation of police incident records, time and location in San Francisco, California during 2018-2020 based on the San Francisco police opendata. Since crime has been a major social concern of urban areas, when these unlawful activities happened and reported can be important to San Francisco as a major city of U.S.. Although the reason why crime happened in certain time will not be revealed in this project, assumptions may be provided. The dataset contains police records in San Francisco since 2018 but only data from 2018 to 2020 will be included in the relation analysis. Some records with too much missing critial values will be automatically removed from the analysis. These police records are used as an indicator of unlawful activities, which may be referred as crime in the following.

Questions:

if (!require(readr)) install.packages("readr")
if (!require(dplyr)) install.packages("dplyr")
if (!require(ggrepel)) install.packages("ggrepel")
if (!require(leaflet)) install.packages("leaflet")

Data Preprocessing

Read the data

Examine the dataset.

library(readr)

# path <- "/Users/zhenhuang/Downloads/Final Project/CRIME2018_RECENT.csv"
path <- "/Users/zhenhuang/Downloads/Final Project/Police_Department_Incident_Reports__2018_to_Present.csv"
df <- read_csv(path)
 # str(df)

Preprocess Data

The original column names are a bit redundant. This step is to simplify the column names that may be used in the following analysis.

library(tidyverse)
library(dplyr) 
df <- df %>% rename(Date=`Incident Date`,
                     Time = `Incident Time`,
                     Year = `Incident Year`,
                     DayOfWeek = `Incident Day of Week`,
                     Category=`Incident Category`,
                     Descript=`Incident Description`,
                     PdDistrict = `Police District`,
                     Y = Latitude,
                     X = Longitude) %>% 
              mutate(Time = as.character(Time))

 # str(df)

Data Overview

Daily record Trend

The graph indicates the daily police records number decreased dramatically during 2020. (probably because of the Covid-19, everyone just stayed at home) The months with lowest daily number of police records is around March to April 2020 when the government had released the quarantine announcement. Generally, the number of daily police records after 2020 are much lower than the previous.

library(dplyr)
library(ggplot2)

df_crime_daily <- df %>%
  mutate(Date = as.Date(Date, "%Y/%m/%d")) %>%
  group_by(Date) %>%
  summarize(count = n()) %>%
  arrange(Date)

df_crime_daily %>% 
  ggplot( aes(x = Date, y = count)) +
  geom_line(color = "#4EA8DE", size = 0.1) +
  geom_smooth() +
  labs(x = "Date", y = "Number of Police Records", title = "Daily Police Records in San Francisco (2018–Recent)")

### Interactive Map of Crime Incidents This map shows the locations of crime incidents during 2018-2020. Clicking on pop up icons on the map can show incident details.

library(leaflet)
 # data <- df  
# data <- df %>% filter(Year %in% c(2018,2019,2020)) 
  data <- df[ 1:10000,]# Testing Only:using only the first 10,000 rows (much faster)
data$popup <- paste("<b>Incident #: </b>", data$`Incident Number`, 
                    "<br>", "<b>Category: </b>", data$Category,
                    "<br>", "<b>Description: </b>", data$Descript,
                    "<br>", "<b>Day of week: </b>", data$DayOfWeek,
                    "<br>", "<b>Date: </b>", data$Date,
                    "<br>", "<b>Time: </b>", data$Time,
                    "<br>", "<b>PD district: </b>", data$PdDistrict,
                    "<br>", "<b>Resolution: </b>", data$Resolution,
                    "<br>", "<b>Longitude: </b>", data$X,
                    "<br>", "<b>Latitude: </b>", data$Y) 

data %>% 
  leaflet() %>% 
  addTiles() %>%
  addMarkers(lng = ~X, lat = ~Y, popup = data$popup, clusterOptions = markerClusterOptions()) 

Aggregate Data

Summarize the data by incident category. According to the list of the most frequent record categories, Larceny Theft takes about 30% of all records, being the category with the largest percentage. The top 20 frequent categories are, (starting from the most frequesnt), Larceny Theft, Other Miscellaneous, Malicious Mischief, Non-criminal, Assault, Burglary, Motor Vehicle Theft, Recovered Vehicle, Warrant and Lost Property.

df_category <- df %>% 
  select(Category) %>% 
  group_by(Category) %>% 
  summarise(Frequency=n()) %>% 
  mutate(Percentage=(Frequency*100 / sum(Frequency))) %>% 
  arrange(desc(Frequency)) 

 head(df_category,20)
## # A tibble: 20 x 3
##    Category                                 Frequency Percentage
##    <chr>                                        <int>      <dbl>
##  1 Larceny Theft                               142622     30.2  
##  2 Other Miscellaneous                          34758      7.35 
##  3 Malicious Mischief                           31074      6.57 
##  4 Non-Criminal                                 29055      6.14 
##  5 Assault                                      28308      5.99 
##  6 Burglary                                     26585      5.62 
##  7 Motor Vehicle Theft                          21894      4.63 
##  8 Recovered Vehicle                            17141      3.62 
##  9 Warrant                                      15508      3.28 
## 10 Lost Property                                14690      3.11 
## 11 Fraud                                        14235      3.01 
## 12 Drug Offense                                 11484      2.43 
## 13 Robbery                                      11071      2.34 
## 14 Missing Person                               10616      2.25 
## 15 Suspicious Occ                                9440      2.00 
## 16 Disorderly Conduct                            8023      1.70 
## 17 Offences Against The Family And Children      6602      1.40 
## 18 Traffic Violation Arrest                      5526      1.17 
## 19 Miscellaneous Investigation                   4456      0.942
## 20 Other Offenses                                4009      0.848

This prie chart shows the percentage that each record category takes in all records.

df_category %>% ggplot(aes(x="", y=Percentage, fill=Category)) + geom_bar(stat="identity") + coord_polar("y") 

The following is a bar plot of incident categories with high frequency.

df_category %>% 
  filter(Frequency > 20000) %>% 
  ggplot(aes(x= reorder(Category, -Frequency), y=Frequency, fill=Category)) + 
  geom_bar(stat="identity") 

Record number by date and category: According to the graph, Burgary had a peak around May 2020 and it has increased a bit since early 2020; Larceny Theft has the largest variance and it has had the similar dramatic decrease to the all records trend since early 2020; Lost Property also has expereinced some decrease since earily 2020; Warrant had a few peaks during 2019. Generally speaking, obvious decrease since early 2020 can be observed in Assault, Larceny Theft, Lost Property, Non-criminal, Other Miscellaneous, Robbery and Warrant. However, Burglary, Motor Vehicle Theft and Recovered Vehicle have expereinced some increase since early 2020.

df_var <- df %>%
  mutate(Date = as.Date(Date, "%Y/%m/%d")) %>%
  group_by(Category,Date) %>%
  summarize(count = n()) 

df_var %>%
  ggplot( aes(x = Date, y = count)) +
  geom_line() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Date", y = "Count", title = "Number of Police Records in San Francisco by Category and Date (2018-2020)") +
  facet_wrap(~ Category, nrow = 5)

df_var <- df_var %>% 
  spread( Category, count) 

# df_var$Burglary %>% var()
# df_var$`Malicious Mischief` %>% var()
# df_var$`Larceny Theft` %>% var()

Correlation Analysis for crime reports

The following graphs aimt to explain the relation between Time/Day of the Week and crime records

Factor by Crime Category

To further discuss crime category as a factor of time of crime reported, display the heatmap by crime category.

df_whole_time_crime <- df_whole %>%
  filter(Category %in% df_top_crimes$Category) %>%
  mutate(Hour = sapply(Time, get_hour)) %>%
  group_by(Category, DayOfWeek, Hour) %>% 
  summarize(count = n())

df_whole_time_crime$DayOfWeek <- factor(df_whole_time_crime$DayOfWeek, level = rev(dow_format))
df_whole_time_crime$Hour <- factor(df_whole_time_crime$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_crime)

 df_whole_time_crime %>% ggplot( aes(x = Hour, y = DayOfWeek, fill = count)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Category and Time (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
  facet_wrap(~ Category, nrow = 6)

The above data has too many Larceny Theft records to show the time distribution of other categories, thus, it needs normalization.

# df_whole_time_crime <- df_whole_time_crime %>%
#   group_by(Category) %>%
#   mutate(norm = count/sum(count))

# normalized percent inhibition
df_whole_time_crime %>%
  group_by(Category) %>%
  mutate(norm = (count-min(count))/(max(count)-min(count))) %>% 
  ggplot(aes(x = Hour, y = DayOfWeek, fill = norm)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Time, Normalized by Category (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
  facet_wrap(~ Category, nrow = 6)

Adding analysis with details…….

Factor by Police District

Same as above, but with Police Districts.

# exclude Police Districts that are out of San Francisco
df_whole_time_district <- df_whole %>%
  mutate(Hour = sapply(Time, get_hour)) %>%
  group_by(PdDistrict, DayOfWeek, Hour) %>% 
  summarize(count = n()) %>%
  group_by(PdDistrict) %>%
  filter(!PdDistrict=="Out of SF" ) %>% 
  mutate(norm = (count-min(count))/(max(count)-min(count)))

df_whole_time_district$DayOfWeek <- factor(df_whole_time_district$DayOfWeek, level = rev(dow_format))
df_whole_time_district$Hour <- factor(df_whole_time_district$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_district)

df_whole_time_district %>% 
  ggplot( aes(x = Hour, y = DayOfWeek, fill = norm)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Time, Normalized by Station (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#5390D9") +
  facet_wrap(~ PdDistrict, nrow = 7)

Factor by Month

If crime is tied to activities, the period at which activies end may impact.

df_whole_time_month <- df_whole %>%
  mutate(Month = format(as.Date(Date, "%Y/%m/%d"), "%B"), Hour = sapply(Time, get_hour)) %>%
  group_by(Month, DayOfWeek, Hour) %>% 
  summarize(count = n()) %>%
  group_by(Month) %>%
  mutate(norm = (count-min(count))/(max(count)-min(count)))

df_whole_time_month$DayOfWeek <- factor(df_whole_time_month$DayOfWeek, level = rev(dow_format))
df_whole_time_month$Hour <- factor(df_whole_time_month$Hour, level = 0:23, label = hour_format)
# Set order of month 
df_whole_time_month$Month <- factor(df_whole_time_month$Month,
                                     level = c("January","February","March","April","May","June","July","August","September","October","November","December"))

df_whole_time_month %>% 
  ggplot( aes(x = Hour, y = DayOfWeek, fill = norm)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Police Records in San Francisco by Time, Normalized by Month (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#23A9A9") +
  facet_wrap(~ Month, nrow = 6)

Factor By Year

Perhaps things changed over years?

df_whole_time_year <- df_whole %>%
  mutate(Hour = sapply(Time, get_hour)) %>%
  group_by(Year, DayOfWeek, Hour) %>% 
  summarize(count = n()) %>%
  group_by(Year) %>%
  mutate(norm = (count-min(count))/(max(count)-min(count)))

df_whole_time_year$DayOfWeek <- factor(df_whole_time_year$DayOfWeek, level = rev(dow_format))
df_whole_time_year$Hour <- factor(df_whole_time_year$Hour, level = 0:23, label = hour_format)

df_whole_time_year %>%
  ggplot(aes(x = Hour, y = DayOfWeek, fill = norm)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Police Records in San Francisco by Time, Normalized by Year (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#00A375") +
  facet_wrap(~ Year, nrow = 3)

Arrest Trend

#excluding 2021, since 2021 has not ended
#only arrest records 
df_whole <- df %>% 
  filter(Year %in% c(2018,2019,2020) & grepl("Arrest", Resolution)) 

df_whole %>%
  mutate(Date = as.Date(Date, "%Y/%m/%d")) %>%
  group_by(Date) %>%
  summarize(count = n()) %>%
  arrange(Date) %>%
  ggplot(aes(x = Date, y = count)) +
  geom_line(color = "#4EA8DE", size = 0.1) +
  geom_smooth() +
  labs(x = "Date of Arrest", y = "# of Police Arrests", title = "Number of Arrest in San Francisco (2018–2020)")

df_whole_time <- df_whole %>%
  mutate(Hour = sapply(Time, get_hour)) %>%
  group_by(DayOfWeek, Hour) %>%
  summarize(count = n())

df_whole_time$DayOfWeek <- factor(df_whole_time$DayOfWeek, level = rev(dow_format))
df_whole_time$Hour <- factor(df_whole_time$Hour, level = 0:23, label = hour_format)

df_whole_time %>%  
  ggplot(aes(x = Hour, y = DayOfWeek, fill = count)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Arrest in San Francisco by Time (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#6930c3")

To minimize the effect of arrest with low frequencies, create heatmap for top 12 crime categories with highest frequency

df_top_crimes <- df_whole %>%
  group_by(Category) %>% 
  summarize(count = n()) %>%
  arrange(desc(count)) 
df_top_crimes <- df_top_crimes[1:12,]

df_whole_time <- df_whole %>%
  filter(Category %in% df_top_crimes$Category) %>%
  mutate(Hour = sapply(Time, get_hour)) %>%
  group_by(DayOfWeek, Hour) %>%
  summarize(count = n())

df_whole_time$DayOfWeek <- factor(df_whole_time$DayOfWeek, level = rev(dow_format))
df_whole_time$Hour <- factor(df_whole_time$Hour, level = 0:23, label = hour_format)

df_whole_time %>%  
  ggplot(aes(x = Hour, y = DayOfWeek, fill = count)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Frequent Crime Arrests in San Francisco by Time (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#6930c3")

High-frequency arrests are mostly reported during 3-4 PM on Tuesdays and 12-2PM on Wednesdays.

Correlation Analysis Arrest Records

Factor by Crime Category

To further discuss crime category as a factor of time of crime reported, display the heatmap by crime category.

df_whole_time_crime <- df_whole %>%
  filter(Category %in% df_top_crimes$Category) %>%
  mutate(Hour = sapply(Time, get_hour)) %>%
  group_by(Category, DayOfWeek, Hour) %>% 
  summarize(count = n())

df_whole_time_crime$DayOfWeek <- factor(df_whole_time_crime$DayOfWeek, level = rev(dow_format))
df_whole_time_crime$Hour <- factor(df_whole_time_crime$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_crime)

df_whole_time_crime %>% ggplot( aes(x = Hour, y = DayOfWeek, fill = count)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Category and Time (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
  facet_wrap(~ Category, nrow = 6)

#  # since Fraud and Lost Property have too many missing data, they will be excluded in the following analysis
# df_top_crimes <- df_top_crimes %>% 
#   filter(!Category %in% c("Fraud","Lost Property"))
# 
# #draw again without Fraud and Lost Property
# df_whole_time_crime <- df_whole %>%
#   filter(Category %in% df_top_crimes$Category) %>%
#   mutate(Hour = sapply(Time, get_hour)) %>%
#   group_by(Category, DayOfWeek, Hour) %>% 
#   summarize(count = n())
# 
# df_whole_time_crime$DayOfWeek <- factor(df_whole_time_crime$DayOfWeek, level = rev(dow_format))
# df_whole_time_crime$Hour <- factor(df_whole_time_crime$Hour, level = 0:23, label = hour_format)
# 
# df_whole_time_crime %>% ggplot( aes(x = Hour, y = DayOfWeek, fill = count)) +
#   geom_tile() +
#   theme(axis.text.x = element_text(angle = 90)) +
#   labs(x = "Time", y = "Day of Week", title = "Arrests in San Francisco by Category and Time (2018-2020)") +
#   scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
#   facet_wrap(~ Category, nrow = 5)

Normalization.

# normalized percent inhibition
df_whole_time_crime %>%
  group_by(Category) %>%
  mutate(norm = (count-min(count))/(max(count)-min(count))) %>% 
  ggplot(aes(x = Hour, y = DayOfWeek, fill = norm)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Arrests in San Francisco by Time, Normalized by Category (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
  facet_wrap(~ Category, nrow = 6)

Adding analysis with details…….

Factor by Police District

Same as above, but with Police Districts.

# exclude Police Districts that are out of San Francisco
df_whole_time_district <- df_whole %>%
  mutate(Hour = sapply(Time, get_hour)) %>%
  group_by(PdDistrict, DayOfWeek, Hour) %>% 
  summarize(count = n()) %>%
  group_by(PdDistrict) %>%
  filter(!PdDistrict=="Out of SF" ) %>% 
  mutate(norm = (count-min(count))/(max(count)-min(count)))

df_whole_time_district$DayOfWeek <- factor(df_whole_time_district$DayOfWeek, level = rev(dow_format))
df_whole_time_district$Hour <- factor(df_whole_time_district$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_district)

df_whole_time_district %>% 
  ggplot( aes(x = Hour, y = DayOfWeek, fill = norm)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Time", y = "Day of Week", title = "Arrests in San Francisco by Time, Normalized by Station (2018-2020)") +
  scale_fill_gradient(low = "#ffffff", high = "#5390D9") +
  facet_wrap(~ PdDistrict, nrow = 7)